Comments for MEDB 5501, Week 3

Count the occurrences of the letter “e”.

A quality control program is easiest
to implement from the top down. 
Make sure that you understand the 
the commitment of time and money
that is involved. Every workplace is
different, but think about allocating
10% of your time and 10% of the 
time of all your employees to 
quality control.

A practical counting example

Image of a haemocytometer

Measurement error

  • Imprecision in a physical measurement
    • Example: GPS location
      • Can be off by up to 8 meters
      • Worse around large buildings
    • Other examples
      • Weight
      • Body temperature
      • Blood glucose

Reducing measurement error

  • Calibration
  • Consistent environment
  • Good equipment
  • Quality control
  • Training

Errors of validity

  • Mostly used for constructs
  • Types of validity
    • Criterion
      • Concurrent
      • Predictive
    • Content/face
    • Many others
  • Re-establishing validity

Errors of reliability

  • Synonym: repeatability(?)
  • Not reproducibility
  • Both physical measurements and constructs
  • Types of reliability
    • Test-retest
    • Inter-rater
    • Inter-method

Errors due to sampling

  • To be covered later
  • Easiest to quantify
  • Less important in era of big data

Short break

  • What have you learned?
    • Errors
      • Measurement
      • Validity
      • Reliability
      • Sampling
  • What’s next?
    • Descriptive statistics
      • Mean
      • Median

Cartoon image of Professor Mean

Road with a median strip

Calculation of the mean and median

  • Mean
    • Add up all the values, divide by the sample size
  • Median
    • Sort the data
      • Select the middle value if n is odd
      • go halfway between the two middle values if n is even

Formal mathematical definitions

  • Mean
    • \(\bar{X}=\frac{1}{n}\Sigma X_i\)
  • Median
    • Sorted values \(X_{[1]},X_{[2]},...,X_{[n]}\)
      • \(X_{[(n+1)/2]}\) if n is odd,
      • \((X_{[n/2]}+X_{[n/2+1]})/2\) if n is even

Bacteria before and after A/C upgrade

Room Before  After
 121   11.8   10.1
 125    7.1    3.8
 163    8.2    7.2
 218   10.1   10.5
 233   10.8    8.3
 264   14     12  
 324   14.6   12.1
 325   14     13.7

Before remediation mean

11.8 + 7.1 + 8.2 + 10.1 + 10.8 + 14 + 14.6 + 14 = 90.6

90.6 / 8 = 11.325

Round to 11.3

After remediation mean

10.1 + 3.8 + 7.2 + 10.5 + 8.3 + 12 + 12.1 + 13.7 = 77.7

77.7 / 8 = 9.7125

Round to 9.7

Before remediation median (1/4)

121  11.8

125   7.1

163   8.2

218  10.1

233  10.8

264  14.0

324  14.6

325  14.0

Before remediation median (2/4)

125   7.1

163   8.2

218  10.1

233  10.8

121  11.8

264  14.0

325  14.0

324  14.6

Before remediation median (3/4)

125   7.1  
  
163   8.2  
  
218  10.1  
  
233  10.8  10.8
  
121  11.8  11.8
  
264  14.0  
  
325  14.0  
  
324  14.6  

Before remediation median (4/4)

125   7.1  
  
163   8.2  
  
218  10.1  
  
233  10.8  10.8
                  (10.8 + 11.8) / 2 = 11.3
121  11.8  11.8
  
264  14.0  
  
325  14.0  
  
324  14.6  

After remediation median (1/4)

121  10.1

125   3.8

163   7.2

218  10.5

233   8.3

264  12.0

324  12.1

325  13.7

After remediation median (2/4)

125   3.8

163   7.2

233   8.3

121  10.1

218  10.5

264  12.0

324  12.1

325  13.7

After remediation median (3/4)

125   3.8  
  
163   7.2  
  
233   8.3  
  
121  10.1  10.1
  
218  10.5  10.5
  
264  12.0  
  
324  12.1  
  
325  13.7  

After remediation median (4/4)

125   3.8  
  
163   7.2  
  
233   8.3  
  
121  10.1  10.1
                  (10.1 + 10.5) / 2 = 10.3
218  10.5  10.5
  
264  12.0  
  
324  12.1  
  
325  13.7  

Criticisms of the mean and median

  • Are you combining apples and onions?
  • Are you ignoring minorities?

Excerpt from Gould 1985 publication

Choosing between the mean and median

  • Often, either is fine
  • When do you use the mean?
    • When totals are important
    • “In 2020, the average expenditure by the Italian National Health Service (Servizio Sanitario Nazionale, SSN) per patient affected by at least one chronic disease was approximately 696 euros.”
  • When do you use the median
    • When outliers/skewness might distort your conclusions

Chen et al 2019

Chen 2019, PMID: 31806195 (continued)

Background: The prices of newly approved cancer drugs have risen over the past decades. A key policy question is whether the clinical gains offered by these drugs in treating specific cancer indications justify the price increases.

Chen 2019, PMID: 31806195 (continued)

Results: We found that between 1995 and 2012, price increases outstripped median survival gains, a finding consistent with previous literature. Nevertheless, price per mean life-year gained increased at a considerably slower rate, suggesting that new drugs have been more effective in achieving longer-term survival. Between 2013 and 2017, price increases reflected equally large gains in median and mean survival, resulting in a flat profile for benefit-adjusted launch prices in recent years.

Break

  • What have you just learned?
    • Criticisms of the mean and median
  • What is coming next?
    • Computing percentiles

Illustration of the 75th percentile

Computing percentiles

  • Many formulas
    • Differences are not worth fighting over
  • My preference (pth quantile)
    • Sort the data
    • Calculate p*(n+1)
    • Is it a whole number?
      • Yes: Select that value, otherwise
      • No: Go halfway between
      • Special cases: p(n+1) < 1 or > n

Some examples of percentile calculations

  • Example for n=39
    • For 5th percentile, p(n+1)=2 -> 2nd smallest value
    • For 4th percentile, p(n+1)=1.6 -> halfway between two smallest values
    • For 2nd percentile, p(n+1)=0.8 -> smallest value

Some terminology

  • Percentile: goes from 0% to 100%
  • Quantile: goes from 0.0 to 1.0
    • 90th percentile = 0.9 quantile
  • 25th, 50th, and 75th percentiles: quartiles
    • 25th percentile: \(Q_1,\ X_{0.25}\) or lower quartile
    • Median/50th percentiles: \(Q_2\) or \(X_{0.5}\)
    • 75th percentile: \(Q_3,\ X_{0.75}\) or upper quartile

Before remediation upper quartile (1/4)

121  11.8

125   7.1

163   8.2

218  10.1

233  10.8

264  14.0

324  14.6

325  14.0

Before remediation upper quartile (2/4)

125   7.1

163   8.2

218  10.1

233  10.8

121  11.8

264  14.0

325  14.0

324  14.6

Before remediation upper quartile (3/4)

125   7.1  
  
163   8.2  
  
218  10.1  
  
233  10.8  
  
121  11.8  
  
264  14.0  14
  
325  14.0  14
  
324  14.6  

Before remediation upper quartile (4/4)

125   7.1  
  
163   8.2  
  
218  10.1  
  
233  10.8  
  
121  11.8  
  
264  14.0  14
                  (14 + 14) / 2 = 14
325  14.0  14
  
324  14.6  

After remediation upper quartile (1/4)

121  10.1

125   3.8

163   7.2

218  10.5

233   8.3

264  12.0

324  12.1

325  13.7

After remediation upper quartile (2/4)

125   3.8

163   7.2

233   8.3

121  10.1

218  10.5

264  12.0

324  12.1

325  13.7

After remediation upper quartile (3/4)

125   3.8  
  
163   7.2  
  
233   8.3  
  
121  10.1  
  
218  10.5  
  
264  12.0  12
  
324  12.1  12.1
  
325  13.7  

After remediation upper quartile (4/4)

125   3.8  
  
163   7.2  
  
233   8.3  
  
121  10.1  
  
218  10.5  
  
264  12.0  12
                  (12 + 12.1) / 2 = 12.05
324  12.1  12.1
  
325  13.7  

When you should use percentiles

  • Characterize variation
    • Middle 50% of the data
  • Exposure issues
    • Not enough to control median exposure level
  • Quantify extremes
    • What does “upper class” mean?
  • Quality control
    • Almost all products must meet a minimum standard

Break

  • What have you just learned?
    • Computing percentiles
  • What is coming next?
    • Computing the standard deviation

Standard deviation

\[S = \sqrt{\frac{1}{n-1}\Sigma(X_i-\bar{X})^2}\]

At least one alternative formula.

Why is variation important

  • Variation = Noise
    • Too much noise can hide signals
  • Variation = Heterogeneity
    • Too little heterogeneity, hard to generalize
    • Too much heterogeneity, mixing apples and oranges
  • Variation = Unpredictability
    • Too much unpredictability, hard to prepare for the future
  • Variation = Risk
    • Too much risk can create a financial burden

Should you try to minimize variation?

  • Yes, for early studies
    • Easier to detect signals
    • Proof of concept trials
  • No, for later studies
    • Easier to generalize results
    • Pragmatic trials

The bell shaped curve

  • Does your variation follow a bell shaped curve?
  • Synonyms: normality, normal distribution
    • Values in the middle are most common
    • Frequencies taper off away from the center
    • Symmetry on either side
  • A bell shaped curve = better characterization of variation

Bimodal histogram, not a bell shaped curve

Skewed histogram, not a bell shaped curve

Uniform histogram, not a bell shaped curve

Heavy-tailed histogram, not a bell shaped curve

Bell-shaped histogram, finally!

Why concern yourself with the bell shaped curve?

  • You can characterize individual observations
  • You can characterize summary measures

Percentage within one standard deviation

Percentage within two standard deviations

Percentage within three standard deviations

Behavior of the mean versus an individual

  • Central Limit Theorem
    • Sample mean is approximately normal
    • Even if individual observations are not
  • Standard error: \(S/\sqrt{n}\)

Diagnosing distributional issues (1/2)

  • For all data
    • \(\bar{X} \gg X_{0.5}\)
    • \(\bar{X}\) and/or \(X_{0.5}\) not midway between \(Q_1\) and \(Q_3\)
    • \(\bar{X}\) and/or \(X_{0.5}\) not midway between min and max

Diagnosing distributional issues (2/2)

  • For non-negative data
    • \(S > 0.5 \times \bar{X}\)
  • For data with an lower and/or upper bound
    • \(Q_1\) = lower bound
    • \(Q_3\) = upper bound
  • Don’t overdiagnose, especially with small sample sizes!

Lin et al 2022, PMID: 36126916

Excerpt from Table 1 of Lin et al 2022: ages

Excerpt from Table 1 of Lin et al 2022: CCI

Excerpt from Table 1 of Lin et al 2022: PHQ-2

Tosato et al 2021, PMID: 34352201

Tosato 2021, PMID: 34352201 (continued)

Symptom persistence weeks after laboratory-confirmed severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) clearance is a relatively common long-term complication of Coronavirus disease 2019 (COVID-19). Little is known about this phenomenon in older adults. The present study aimed at determining the prevalence of persistent symptoms among older COVID-19 survivors and identifying symptom patterns.

Tosato 2021, PMID: 34352201 (continued)

The mean age was 73.1 ± 6.2 years (median 72, interquartile range 27), and 63 (38.4%) were women. The average time elapsed from hospital discharge was 76.8 ± 20.3 days (range 25-109 days).

Ielapi 2021, PMID: 34968328

Ielapi et al 2021, PMID: 34968328

Ielapi 2021, PMID: 34968328 (continued)

Background. Insomnia is one of the major health problems related with a decrease in quality of life (QOL) and also in poor functioning in night-shift nurses, that also may negatively affect patients’ care. The aim of this study is to evaluate the prevalence of insomnia in night shift nurses.

Ielapi 2021, PMID: 34968328 (continued)

Excerpt from Table 1. Data reported as mean ± standard deviation or median [Q1-Q3]

Overall (n = 2′355)
Age, years  40.4 ± 10.3
Months of work 168 [72–300]
Night shifts per month, number  6.3 ± 1.4
Time to reach workplace, minutes    45 [45–65]

Ielapi 2021, PMID: 34968328 (continued)

Excerpt from Table 1. Data reported as mean ± standard deviation or median [Q1-Q3]

Rest time, minutes  180 [4–240]
Rest in the afternoon, minutes  30 [0–120]
Number of coffees, mean 2.5 ± 1.5
Number of coffees during night shift, mean  1.4 ± 1.1

Normal probabilities in SPSS (P[Z < 2.5]=?)

Screenshot of SPSS dialog box

Normal probabilities in SPSS (P[Z < 2.5]=0.9938)

Screenshot of SPSS data window

Normal probabilities in SPSS (P[Z > 2.5]=?)

Screenshot of SPSS dialog box

Normal probabilities in SPSS (P[Z > 2.5]=0.0062)

Screenshot of SPSS data window

Normal probabilities in SPSS (P[-2.5 < Z < 2.5]=?)

Screenshot of SPSS dialog box

Normal probabilities in SPSS (P[-2.5 < Z < 2.5]=0.9876)

Screenshot of SPSS data window

Normal percentiles in SPSS (P[Z < ?]=0.75)

Screenshot of SPSS dialog box

Normal percentiles in SPSS (P[Z < 0.67]=0.75)

Screenshot of SPSS data window

Normal percentiles in SPSS (P[Z < ?]=0.25)

Screenshot of SPSS dialog box

Normal percentiles in SPSS (P[Z < -0.67]=0.25)

Screenshot of SPSS data window

Normal probability plot in SPSS (1/2)

Screenshot of SPSS dialog box

Normal probability plot in SPSS (2/2)

Screenshot of SPSS data window

Standardizing data in SPSS (1/2)

Screenshot of SPSS dialog box

Normal probability plot in SPSS (2/2)

Screenshot of SPSS data window

What is a population?

  • Population: a group that you wish to generalize your research results to. it is defined in terms of
    • Demography,
    • Geography,
    • Occupation,
    • Time,
    • Care requirements,
    • Diagnosis,
    • Or some combination of the above.

Example of a population

All infants born in the state of Missouri during the 1995 calendar year who have one or more visits to the Emergency room during their first year of life.

What is a sample?

  • Sample: subset of a population.
  • Random sample: every person has the same probability of being in the sample.
  • Biased sample: Some people have a decreased probability of being in the sample.
    • Always ask “who was left out?”

An example of a biased sample

  • A researcher wants to characterize illicit drug use in teenagers. She distributes a questionnaire to students attending a local public high school
  • (in the U.S. high school is grades 9-12, which is mostly students from ages 14 to 18.)
  • Explain how this sample is biased.
  • Who has a decreased or even zero probability of being selected.

Type your ideas in the chat box.

Fixing a biased sample

  • Redfine your population
    • Not all teenagers,
      • but those attending public high schools.

What is a parameter?

  • A parameter is a number computed from a sample.
    • Examples
      • Average health care cost associated with the 29,637 children
      • Proportion of these 29,637 children who died in their first year of life.
      • Correlation between gestational age and number of ER visits of these 29,637 children.
    • Designated by Greek letters (\(\mu\), \(\pi\), \(\rho\))

What is a statistic?

  • A statistic is a number computed from a sample
    • Examples
      • Average health care cost associated with 100 children.
      • Proportion of these 100 children who died in their first year of life.
      • Correlation between genstational age and number of ER visits of these 100 children.
    • Designated by non-Greek letters (\(\bar{X}\), \(\hat{p}\), r).

What is Statistics?

  • Statistics
    • The use of information from a sample (a statistic) to make inferences about a population (a parameter)
      • Often a comparison of two populations

The median is not the message

Q15, Q20